Emotional Analysis of Blogs and Forums Data 
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We perform a statistical analysis of emotionally annotated comments in two large online datasets, 
examining chains of consecutive posts in the discussions. Using comparisons with randomised data 
we show that there is a high level of correlation for the emotional content of messages. 
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I. INTRODUCTION 

Recent years have resulted in several well motivated 
and carefully described studies coping with the problem 
of opinion formation and its spreading This kind of 
research usually aimed at qualitative descriptions of some 
specific phenomena using both numerical and analytical 
methods and touched problems like culture dissemination 
@], decision making [3], majority rule voting social 
impact [j| or community isolation [6]. The bottleneck of 
such studies is always the lack of real-world data that 
could sustain the presented theories. On the other hand, 
the rapid and overwhelming development of the Inter- 
net enables gathering information on its users and their 
habits, spotting characteristic structures [§| and users' 
behaviour [9]. However, all these works have not delved 
into a crucial aspect of any analysis of new-born me- 
dia like Internet blogs or forums: their emotional con- 
tent. It is only lately that such analyses have started to 
emerge |10l - fl3| . In this paper we focus on the properties 
of emotionally annotated chains of posts from two large 
online datasets. We give strong evidence that the dis- 
cussions cannot be treated as random insertions of com- 
ments showing various measures of correlation with the 
emotional content of posts. The paper is complementary 
to recent analyses of cluster formation and the influence 
of ne gati ve emotions on the properties of online discus- 
sions m m. 



II. DATA DESCRIPTION 

The aim of this study was to find common proper- 
ties of comment chains in Internet blogs. The analysis 
was performed on two datasets: Blogs and BBC Forums. 
The BBC web site had a number of publicly-open mod- 
erated Message Boards covering a wide variety of topics 
that allow registered users to start their own discussions 
and post comments on existing discussions. Our data 
included discussions posted on the Religion and Ethics 



and World/UK News message boards starting from the 
launch of the website (July 2005 and June 2005 respec- 
tively) until June 2009. The Blogs dataset is a subset of 
the Blogs06 collection of blog posts from 06/12/2005 
to 21/02/2006. Only posts attracting more than 100 
comments were extracted, as these apparently initialised 
non-trivial discussions. Both datasets have similar struc- 
tures. They consist of blog posts and corresponding in- 
dexed comments possessing two values: positive proba- 
bility Pp OS and subjective probability P pos (both are real 
numbers between and 1) forming a chain of comments 
x\,X2, ...,x n -i,XN {N is the thread length). These val- 
ues are the output of a sentiment analysis classifier that is 
informed by previous studies on the extraction of emotion 
from texts [17], LL8| ■ Sentiment analysis algorithms often 
operate in stages as follows: (a) separating objective from 
subjective texts, (b) predicting the polarity of the sub- 
jective texts, and (c) detecting the sentiment target fljjj . 
Our algorithm used supervised, machine-learning princi- 
ples [20[ | . For this, we implemented an hierarchical exten- 
sion of a standard Language Model (LM) classifier pol ]. 
LM classifiers estimate the probability that a given docu- 
ment belongs to each class and then select the class with 
the highest probability. In our hierarchical extension a 
document is first classified by the algorithm as objective 
or subjective and then, for subjective texts a second-stage 
classification determines the polarity as being either pos- 
itive or negative. We used a manually annotated subset 
of about 34,000 documents from the Blogs06 data set as 
a training corpus. The processed datasets have the fol- 
lowing key properties: Blogs consists of 1,232 discussions 
(threads) with 245,698 comments in total while the BBC 
Forums have 97,946 threads with 2,474,781 comments. 
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Histograms of P pos (Fig. [T]) and P su b (Fig. [2]) distri- 
butions were created. One can see that in both cases we 
could approximate the distributions by a bimodal distri- 
bution. In both cases there are two dominating histogram 
bars related to the extreme values and 1. Therefore, 
by looking on Fig. 1 one could say that statistically very 
probably positive comments (later called positive) and 
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very probably negative comments (later called negative) 
occur in threads in large quantities. 
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FIG. 1: Histograms of positive probability values P pos for 
Blogs (upper plot) and BBC Forum (bottom plot). 
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FIG. 2: Histograms of subjective probability values P au b for 
Blogs (upper plot) and BBC Forum (bottom plot). 



IV. MEAN (P pos ) VALUES IN THREADS 

For each thread a mean value (P pos ) was calculated 
for all comments (Fig. [3]). As a comparison, statistical 
predictions were used and every comment in each thread 
had its Pp OS randomised, using the P pos distribution (Fig. 
[T|). One can see that in case of both datasets the plots 
of mean value distributions have a similar, Gaussian-like 
shape with a peak lower than the random predictions. 
For both datasets there is a shift toward positive val- 
ues, which is much stronger in case of Blogs where the 
peak is centred in P pos — 1. This difference between 
data statistics and statistical predictions of the shuffled 
data indicates the presence of strong correlations for pos- 
itive comments in individual threads in the Blogs data. 
These correlations can be effects of mutual affective in- 
teractions between each thread's participants. Any se- 
quence of neighbouring comments within a thread which 
satsify the rule P su b > T would form a subjective cluster. 
For each threshold T an average subjective cluster size 
(S(T)) was calculated (Fig. @| and in order to make a 
reference thread shuffling and global shuffling were used. 



Here, shuffling is understood as a method of random re- 
ordering of the time series in order to destroy any ex- 
isting correlations within the data. It can be performed 
at the thread level by randomising the i index of the 
comment within a thread. Global shuffling is a process 
of randomising data within the whole dataset. In the 
case of the Blogs dataset the subjective attraction has 
a large impact on the structure of every comment chain 
(Fig. |4l upper plot). In comparison to global shuffling, 
the structure of the original data had clusters 26-36 % 
greater in size. Another observation is be that in com- 
parison to thread shuffling this increase was about 7 %. 
It seems that some comment chains have a structure that 
can be only destroyed by global shuffling - for example, 
almost all posts being strongly positive or almost all be- 
ing strongly negative. The main difference between the 
two datasets is that the BBC Forums are less clustered 
(Fig. 01 bottom plot). This conclusion matches the much 
weaker subjectivity attraction behaviour of this dataset. 
The subjective probability in the nth comment is less 
likely to induce a similar value in the next comment, so 
it is common that more drastic shifts occur and therefore 
the clusters break more often. As a consequence of this, 
the mean cluster size is much smaller. 
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FIG. 3: Average positive probability frequency f(P p0 s) for 
subjective comments in case of Blogs (upper plot) and BBC 
Forum (bottom plot). Solid lines with squares come from data 
and dotted lines are statistical predictions. 



V. CORRELATIONS FOR SUBJECTIVE 
PROBABILITIES 

In order to analyze the structure of comment grouping 
a probability correlation ratio was defined 



C(*^n : *^n— 1 ) 



p(x n \x n -i) 

P{x n ) 



(1) 



where p(x n \x n -i) is a conditional probability (here x n 
stands for P su b{n)). The coefficient measures how the 
(ri — l)-th state affects the n-th state in comparison to 
simply picking the n-th state at random. For instance 
C = 2 would mean that in the analysed data subset the 
probability of getting x n if the previous comment was 
x n -\ is two time greater than picking the x n value at 
random. If the dataset is purely random in nature all 
the correlation ratio values would be equal to C = 1. 
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FIG. 4: Average size of the subjective comments cluster 
(S(T)) for a given threshold T for Blogs (upper plot) and 
BBC Forums (bottom plot). Circles represent data without 
shuffling, triangles come from shuffling at the thread level and 
squares represent data from global shuffling. 



The correlation ratio was calculated in the form of PMI 
(Pointwise Mutual Information) as PMI = log C in case 
of subjective probability for each pair in all threads (Fig. 
E}. One can see that in case of both datasets the dis- 
tribution values increase while closing to the diagonal, 
the x n -\ — x n line. This trend is very strong in case 
of Blogs (Fig. [5l upper plot), the diagonal line is very 
distinct. For the BBC Forums there is also a more cor- 
related area of the diagonal which lies in range x n > 0.7 
and x n -i > 0.7. 



Mutual information 



To quantify the amount of mutual dependence between 
two consecutive comments in the thread one can also use 
the concept of mutual information I(X,Y) 21}. It is 
formally defined for two discrete random variables X and 
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Y as 



l(x,Y) = J2 $>(z,y)iog 



p(x)p(y) 



(2) 



where p{x, y) is the joint probability function of X and Y 
while p(x) and are the marginal probability distribu- 
tion functions of X and Y. In our case random variable 
X is equivalent to P pos (or P su b) value of n-th comment 
while variable Y is P pos (or P su b) value of (n — l)-th 
comment. The results of the calculations are shown in 
Tables Q] and fll] As one can see that the values obtained 
for Blogs are significantly different than those for globally 
reshuffled data. This suggests that n-th and (n — l)-th 
comments are not independent of each other. Similar 
results for BBC Forums are less pronounced, which is 
probably related to tree-like structure of those forums. 



No shuffle 


Thread Shuffle 


Global Shuffle 


Blogs 


BBC 


Blogs 


BBC 


Blogs 


BBC 


4.53 


0.41 


3.57 


0.26 


0.05 






TABLE I: Mutual information for positive probability value 
of subsequent comments. The value of I(X, Y) was calculated 
with about 0.05 error due to calculation method simplifica- 
tions 
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TABLE II: Mutual information for subjective probability 
value of subsequent comments. The value of I(X,Y) was 
calculated with about 0.05 error due to calculation method 
simplifications. 
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FIG. 5: PMI for all pairs in all threads in case of P S ub for 
Blogs (upper plot) and BBC Forum (bottom plot). 



VI. THREE-STEP CORRELATIONS 

All subjective positive pairs and negative pairs were 
found in order to calculate the three-step correlation. A 
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FIG. 6: Three-step correlation functions C+ (circles) and C- 
(squares) for Blogs (upper plot) and BBC Forums (bottom 
plot). Grey solid line indicates the level of no correlations. 
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0.1 binning was used so that negative comments 



ranged P pos £ [0,0.1] and positive comments ranged P pos £ [0.9,1.0]. Owing to this it is possible to extend 
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the previous definition of probability correlation relation 
© by 

p(x n \x n -i > 0.9,x„_ 2 > 0.9) 
G+(a;„) = ? — r (3) 



C-(x n ) 



p{x n ) 

P(Xn\Xn-\ < 0-l,X»_2 < 0-1) 

p(x„) 



(4) 



where x„ again stands for P pos {n). The quantities 
C+(x n ) and C_(a; ra ) give the correlation that, after two 
positive (negative) posts the next one will also be posi- 
tive (negative). This approach (Fig. ^ was used in order 
to probe the structure of subjective clusters. 

Figure [6] indicates that positive comments tend to 
group only with other positive comments. Negative com- 
ment pairs are also positively (C-(x n ) > 1) correlated 
with occurrence of next negative comments, but they 
are also positively correlated with some unresolved com- 
ments. A common rule for both datasets could be sug- 
gested that subjective positive groups in thread tend to 
repel other groups. 



VII. CONCLUSIONS 

The analysis performed on the gathered data from In- 
ternet blogs and forums shows definite signs of high cor- 
relations with the emotional content of published com- 
ments. The difference between the observed data and 



simulated values taken from probability distributions 
gives evidence of the existence of certain structures. First 
of all issuing a specific emotion in a comment induces 
with a large probability a similar emotion in the next one. 
This phenomenon is also seen in case of three step cor- 
relation analysis: given the fact that first two comments 
are highly positive/negative, the third one has a tendency 
to be very positive/negative as well. Such rules lead to 
observations of long positive, negative or objective clus- 
ters that far exceed the numbers that would have been 
obtained if there had been no correlations in the data. 
The obtained results are in agreement with our previ- 
ous study showing the collective emotions and growth of 
emotional clusters [l4| . 
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